Regression Analysis using R

BOSTON HOUSING Dataset

Loading Libraries

In [2]:
library("tidyverse")
library("GGally")
library("corrplot")
library("reshape2")
library("plotly")
library(MASS)
library(ggplot2)   # plot
library(calibrate) # textxy
library(car)       #vif, durbin-watson test
library(orcutt)    # To fix atuto-correlation issue
library(leaps)     #regsubset
library(qpcR)      #To calculate PRESS Residual and PRESS Statistic
library(factoextra)# PCA graphs
library(nortest)   # for normality testing - Anderson-Darling etc
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

 ggplot2 3.3.5      purrr   0.3.4
 tibble  3.1.6      dplyr   1.0.8
 tidyr   1.2.0      stringr 1.4.0
 readr   2.1.2      forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()


Increasing the resolution of the plots:

In [2]:
options(repr.plot.height=8, repr.plot.res=300)

Loading the Dataset

In [3]:
boston_dataset <- read_csv("../datasets/BostonHousing.csv")
Rows: 506 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LS...

 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Exploration And Cleaning

Glance of the Dataset

In [4]:
boston_dataset <- boston_dataset %>% 
        dplyr::select(MEDV, everything())
head(boston_dataset)
A tibble: 6 × 14
MEDVCRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
24.00.00632182.3100.5386.57565.24.0900129615.3396.904.98
21.60.02731 07.0700.4696.42178.94.9671224217.8396.909.14
34.70.02729 07.0700.4697.18561.14.9671224217.8392.834.03
33.40.03237 02.1800.4586.99845.86.0622322218.7394.632.94
36.20.06905 02.1800.4587.14754.26.0622322218.7396.905.33
28.70.02985 02.1800.4586.43058.76.0622322218.7394.125.21

Features of the data:

In [5]:
names(boston_dataset)
  1. 'MEDV'
  2. 'CRIM'
  3. 'ZN'
  4. 'INDUS'
  5. 'CHAS'
  6. 'NOX'
  7. 'RM'
  8. 'AGE'
  9. 'DIS'
  10. 'RAD'
  11. 'TAX'
  12. 'PTRATIO'
  13. 'B'
  14. 'LSTAT'

Following information we got from the official site: https://archive.ics.uci.edu/ml/machine-learning-databases/housing

There are 14 attributes in each case of the dataset. They are:

  • CRIM - per capita crime rate by town
  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS - proportion of non-retail business acres per town.
  • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  • NOX - nitric oxides concentration (parts per 10 million)
  • RM - average number of rooms per dwelling
  • AGE - proportion of owner-occupied units built prior to 1940
  • DIS - weighted distances to five Boston employment centres
  • RAD - index of accessibility to radial highways
  • TAX - full-value property-tax rate per $\$10,000$
  • PTRATIO - pupil-teacher ratio by town
  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT - %lower status of the population
  • MEDV - Median value of owner-occupied homes in $1000's

MEDV - Price of the houses is our target variable.

Dimension of the data:

In [6]:
dim(boston_dataset)
  1. 506
  2. 14

So our data has 506 observations of 14 columns.

Datatypes of columns:

In [7]:
str(boston_dataset)
tibble [506 × 14] (S3: tbl_df/tbl/data.frame)
 $ MEDV   : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ CRIM   : num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ ZN     : num [1:506] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ INDUS  : num [1:506] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ CHAS   : num [1:506] 0 0 0 0 0 0 0 0 0 0 ...
 $ NOX    : num [1:506] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ RM     : num [1:506] 6.58 6.42 7.18 7 7.15 ...
 $ AGE    : num [1:506] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ DIS    : num [1:506] 4.09 4.97 4.97 6.06 6.06 ...
 $ RAD    : num [1:506] 1 2 2 3 3 3 5 5 5 5 ...
 $ TAX    : num [1:506] 296 242 242 222 222 222 311 311 311 311 ...
 $ PTRATIO: num [1:506] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ B      : num [1:506] 397 397 393 395 397 ...
 $ LSTAT  : num [1:506] 4.98 9.14 4.03 2.94 5.33 ...

We can see that all the features are already in numerical double formats.

Summary of the data:

In [8]:
summary(boston_dataset)
      MEDV            CRIM                ZN             INDUS      
 Min.   : 5.00   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
 1st Qu.:17.02   1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19  
 Median :21.20   Median : 0.25651   Median :  0.00   Median : 9.69  
 Mean   :22.53   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
 3rd Qu.:25.00   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
 Max.   :50.00   Max.   :88.97620   Max.   :100.00   Max.   :27.74  
      CHAS              NOX               RM             AGE        
 Min.   :0.00000   Min.   :0.3850   Min.   :3.561   Min.   :  2.90  
 1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02  
 Median :0.00000   Median :0.5380   Median :6.208   Median : 77.50  
 Mean   :0.06917   Mean   :0.5547   Mean   :6.285   Mean   : 68.57  
 3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08  
 Max.   :1.00000   Max.   :0.8710   Max.   :8.780   Max.   :100.00  
      DIS              RAD              TAX           PTRATIO     
 Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60  
 1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40  
 Median : 3.207   Median : 5.000   Median :330.0   Median :19.05  
 Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46  
 3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20  
 Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00  
       B              LSTAT      
 Min.   :  0.32   Min.   : 1.73  
 1st Qu.:375.38   1st Qu.: 6.95  
 Median :391.44   Median :11.36  
 Mean   :356.67   Mean   :12.65  
 3rd Qu.:396.23   3rd Qu.:16.95  
 Max.   :396.90   Max.   :37.97  

In the above summary we can see the helpful statistics such as min, max, mean, median and quartiles for all of the columns.

Data Cleaning

Checking Missing Values:

In [7]:
colSums(is.na(boston_dataset))
MEDV
0
CRIM
0
ZN
0
INDUS
0
CHAS
0
NOX
0
RM
0
AGE
0
DIS
0
RAD
0
TAX
0
PTRATIO
0
B
0
LSTAT
0

Inference: There is no NULL or missing values in our data. We will further investigate upon this.

Data Visualization

Pairwise features relation

In [25]:
options(repr.plot.height=12)
ggpairs(boston_dataset)

In the above ggpairs plot we can see the correlation, denisty curve and the scatterplot between pairs simultaenously. Below we will see more plot specifically focused on the correlation and density curves then derive inferences.

Inference:

  • Most of the variables are not normal.
In [24]:
options(repr.plot.height=8)
corrplot.mixed(cor(boston_dataset), lower="number", upper="circle",tl.pos = "lt")

Inference: From the above data we can see the correlation between features. We can infer:

  • (INDUS, NOX), (AGE, NOX), (TAX, INDUS), (TAX, RAD) and (MEDV, RM) paris are highly positively correlated.
  • (DIS, INDUS), (DIS, NOX), (DIS, AGE) and (MEDV, LSTAT) pairs are highly negatively correlated.

Because MEDV is our target variable and we see two highly correlated features with it, we are interested in the jointdensity plot for them. They are given below:

In [12]:
options(repr.plot.height=5, repr.plot.res=200)
# devtools::install_github("WinVector/WVPlots")
suppressMessages(library(WVPlots))
suppressWarnings( ScatterHist( boston_dataset, "MEDV", "LSTAT", title="Joint Density plot of MEDV - LSTAT"))
suppressWarnings( ScatterHist( boston_dataset, "MEDV", "RM", title="Joint Density plot of MEDV - RM"))

Inference:

  • MEDV is highly positively correlated with RM
  • MEDV is highly negatively correlated with LSTAT.

3D Scatterplot for LSTAT, RM and MEDV:

In [8]:
library(rgl)
# Plot
plot3d( 
  x=boston_dataset$LSTAT, y=boston_dataset$RM, z=boston_dataset$MEDV, 
#   col = data$CHAS, 
  type = 's', 
  radius = .1,
  xlab="LSTAT", ylab="RM", zlab="MEDV")

# To display in an R Markdown document:
rglwidget()

# To save to a file:
 htmlwidgets::saveWidget(rglwidget(width = 800, height = 800), 
                        file = "3D-Scatterplot-MEDV-LSTAT-RM.html",
                        libdir = "libs",
                        selfcontained = FALSE
                        )

Inference:

  • The points in the 3D scatterplot isnt lying on a plane.
  • There isnt a proper linear relationship among LSTAT, RM and MEDV.

Checking the outliers and density curve:

In [14]:
#setting the resolution of the plots
par(mfrow=c(1,3))
options(repr.plot.height=3,repr.plot.width=15, repr.plot.res=150)

# a function to plot density curve, histogram and boxplot of a feature simultaenously.
powerPlot <- function(x, name)
{
    dens <- density(x)
    # Histogram
    hist(x, probability = TRUE, xlab = name, ylab="", ylim= c(0, max(dens$y)), col = "grey",
        axes = FALSE, main = "", cex.lab=2)

    # Axis
    axis(1, cex.axis=2)

    # Density
    lines(dens, col = "red", lwd = 1.5)

    # Add boxplot
    par(new = TRUE)
    boxplot(x, horizontal = TRUE, axes = FALSE,
        lwd = 2, col = rgb(0, 1, 1, alpha = 0.15), outlier.colour="red")
}


# a loop to run powerPlot() for all of the features
for (x in names(boston_dataset)){
    powerPlot( boston_dataset[[x]], x)
}

Inference:

  • None of the feature in perfect Normal distribution.
  • We can see that CRIM, ZN, RM, B and MEDV have of outliers.
  • Also we note that there isnt any specific one transformation that fits all features.

Relation of MEDV against other features:

In [15]:
options(repr.plot.height=12, repr.plot.res=200)

rel <- boston_dataset %>%
    melt(id.vars = "MEDV") %>%
    ggplot(aes(x = value, y = MEDV, colour = variable), na.rm=TRUE) +
    geom_point(alpha = 0.7, na.rm=TRUE) +
    stat_smooth(aes(colour = "black"), na.rm=TRUE) +
    facet_wrap(~variable, scales = "free", ncol = 2) +
    labs(x = "Variable Value", y = "Median House Price ($1000s)") +
    theme_minimal( base_size=18) +
    theme( legend.text = element_text( size=14))

suppressMessages(suppressWarnings( print(rel)))

Inference:

  • There is no perfect linear relationship between MEDV and any other feature.
  • Most of the relationship is non-linear.
  • CHAS is a categorical feature, more concentrated on CHAS=0.

Density Curve of Target Variable (MEDV - Price of the houses)

In [32]:
options(repr.plot.height=5, repr.plot.res=200)
boston_dataset %>% 
    ggplot(aes(MEDV)) +
    stat_density(fill="white", col="blue") +
    theme_minimal( base_size=18) +
    ggtitle("MEDV-Price of houses Density curve") + theme( plot.title = element_text( hjust=0.5))

Inference:

  • The target variable is not normal.

Checking Outliers

In [33]:
count=0
outliers=c()
for(i in 1:nrow(boston_dataset)){
    x <- boston_dataset[i,]
    for(col in names(boston_dataset)){
        mu=mean(boston_dataset[[col]])
        sigma=sd(boston_dataset[[col]])
        if(x[[col]]<mu-3*sigma || x[[col]]>mu+3*sigma)
        {
            count=count+1
        }
    }
    if(count>0){
        outliers <- c(outliers, i)
    }
    count=0
}
cat("No. of Outliers:",length(outliers))
No. of Outliers: 91

MODELLING

Feature Engineering

Backward Elimination

In [34]:
## backward elimination
BER=regsubsets(MEDV~.,data=boston_dataset,method="backward")
Modelsummary=cbind(summary(BER)$which,R2=summary(BER)$rsq,SSres=summary(BER)$rss,AdjR2=summary(BER)$adjr2,Cp=summary(BER)$cp,BIC=summary(BER)$bic)
Modelsummary
A matrix: 8 × 19 of type dbl
(Intercept)CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATR2SSresAdjR2CpBIC
1100000000000010.544146319472.380.5432418362.75295-385.0521
2100000100000010.638561615439.310.6371245185.64743-496.2582
3100000100001010.678624213727.990.6767036111.64889-549.4767
4100000101001010.690307713228.910.6878351 91.48526-561.9884
5100001101001010.708089312469.340.7051702 59.75364-585.6823
6100001101001110.715389412157.510.7119672 47.90533-592.2707
7100001101101110.718739612014.400.7147861 43.55007-592.0357
8110001101101110.723976611790.700.7195336 35.61547-595.3196

Significant Features : CRIM , NOX , RM , DIS , RAD , PTRATIO , B , LSTAT

Forward Substitution

In [35]:
## forward substituition
FSR=regsubsets(MEDV~.,data=boston_dataset,method="forward")
# summary(FSR)
Modelsummary=cbind(summary(FSR)$which,R2=summary(FSR)$rsq,SSres=summary(FSR)$rss,AdjR2=summary(FSR)$adjr2,Cp=summary(FSR)$cp,BIC=summary(FSR)$bic)
Modelsummary
A matrix: 8 × 19 of type dbl
(Intercept)CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATR2SSresAdjR2CpBIC
1100000000000010.544146319472.380.5432418362.75295-385.0521
2100000100000010.638561615439.310.6371245185.64743-496.2582
3100000100001010.678624213727.990.6767036111.64889-549.4767
4100000101001010.690307713228.910.6878351 91.48526-561.9884
5100001101001010.708089312469.340.7051702 59.75364-585.6823
6100011101001010.715774212141.070.7123567 47.17537-592.9553
7100011101001110.722161411868.240.7182560 37.05889-598.2295
8101011101001110.726607911678.300.7222072 30.62398-600.1663

Significant Features : ZN , CHAS , NOX , RM , DIS , PTRATIO , B , LSTAT

Stepwise Regression

In [36]:
#Stepwise Regression
SWR=regsubsets(MEDV~.,data=boston_dataset,method="seqrep")
Modelsummary=cbind(summary(SWR)$which,R2=summary(SWR)$rsq,SSres=summary(SWR)$rss,AdjR2=summary(SWR)$adjr2,Cp=summary(SWR)$cp,BIC=summary(SWR)$bic)
Modelsummary
A matrix: 8 × 19 of type dbl
(Intercept)CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATR2SSresAdjR2CpBIC
1100000000000010.544146319472.380.5432418362.75295-385.0521
2100000100000010.638561615439.310.6371245185.64743-496.2582
3100000100001010.678624213727.990.6767036111.64889-549.4767
4100000101001010.690307713228.910.6878351 91.48526-561.9884
5100001101001010.708089312469.340.7051702 59.75364-585.6823
6100011101001010.715774212141.070.7123567 47.17537-592.9553
7100011101001110.722161411868.240.7182560 37.05889-598.2295
8101011101001110.726607911678.300.7222072 30.62398-600.1663

Significant Features : ZN , CHAS , NOX , RM , DIS , PTRATIO , B , LSTAT

Model-1:

We will create a model taking the intersection of features suggested by the above three methods.

The glance of data of the suggested features is given below.

In [5]:
boston_dataset <- boston_dataset %>% dplyr::select(c(MEDV,ZN,CHAS,NOX,RM,DIS,PTRATIO,B,LSTAT))
head(boston_dataset)
A tibble: 6 × 9
MEDVZNCHASNOXRMDISPTRATIOBLSTAT
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
24.01800.5386.5754.090015.3396.904.98
21.6 000.4696.4214.967117.8396.909.14
34.7 000.4697.1854.967117.8392.834.03
33.4 000.4586.9986.062218.7394.632.94
36.2 000.4587.1476.062218.7396.905.33
28.7 000.4586.4306.062218.7394.125.21

Modelling:

In [7]:
modelm <- lm(boston_dataset$MEDV ~ ., data =boston_dataset )
summary(modelm)

# RMSE for checking the accuracy of model
cat("RMSE:", sqrt(sum(modelm$residuals^2)/modelm$df))
Call:
lm(formula = boston_dataset$MEDV ~ ., data = boston_dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6996  -2.7925  -0.5477   1.7005  27.6510 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.316950   4.870856   6.224 1.03e-09 ***
ZN            0.037808   0.013298   2.843 0.004652 ** 
CHAS          3.111062   0.870076   3.576 0.000384 ***
NOX         -16.687428   3.228873  -5.168 3.43e-07 ***
RM            4.116082   0.408594  10.074  < 2e-16 ***
DIS          -1.382714   0.187604  -7.370 7.15e-13 ***
PTRATIO      -0.881851   0.115718  -7.621 1.29e-13 ***
B             0.009404   0.002639   3.563 0.000401 ***
LSTAT        -0.543125   0.047652 -11.398  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.847 on 497 degrees of freedom
Multiple R-squared:  0.7266,	Adjusted R-squared:  0.7222 
F-statistic: 165.1 on 8 and 497 DF,  p-value: < 2.2e-16
RMSE: 4.847431

Model-1
RMSE = 4.84
Adjusted R-Squared = 0.722

Adequecy Checking of Model-1

Checking the multi-collinearity in model-1:

In [11]:
vif(modelm)
ZN
2.06728893313098
CHAS
1.04961460183919
NOX
3.0086351625618
RM
1.77129167967903
DIS
3.35388524639144
PTRATIO
1.34883979102906
B
1.24748830397883
LSTAT
2.48863985876114

Inference:

  • Since all values are well under 5, thus we can conclude that the features are not multicollinear.

Checking Influence Points (Model-1)

In [8]:
## Checking for Influence Points 

inflm.boston <-  influence.measures(modelm)
inf_pts<-which(apply(inflm.boston$is.inf, 1, any))
# which observations 'are' influential

dim(summary(inflm.boston)) # only these
Potentially influential observations of
	 lm(formula = boston_dataset$MEDV ~ ., data = boston_dataset) :

    dfb.1_ dfb.ZN dfb.CHAS dfb.NOX dfb.RM  dfb.DIS dfb.PTRA dfb.B dfb.LSTA
143  0.01   0.00  -0.07    -0.04    0.01   -0.02    0.04    -0.03 -0.03   
145  0.01   0.00  -0.02     0.06   -0.03    0.03   -0.07     0.04  0.03   
146  0.00  -0.01  -0.01     0.02    0.01    0.02   -0.03    -0.01  0.02   
147 -0.01   0.00   0.01    -0.02    0.01   -0.01    0.02     0.01  0.01   
150  0.00   0.00   0.00     0.00    0.00    0.00    0.00     0.00  0.00   
151  0.00   0.00   0.00     0.00    0.00    0.00    0.00     0.00  0.00   
152  0.01   0.00  -0.01     0.02   -0.01    0.01   -0.01     0.00 -0.01   
153 -0.07  -0.01  -0.18    -0.17    0.17   -0.05    0.09     0.00  0.14   
154  0.01   0.00  -0.01     0.04   -0.02    0.02   -0.03    -0.01 -0.02   
155  0.02   0.02  -0.20    -0.19    0.03   -0.10    0.09     0.00  0.04   
156 -0.04   0.03  -0.20    -0.14    0.05   -0.09    0.10     0.16  0.07   
157 -0.02   0.00   0.01    -0.03    0.02   -0.01    0.02     0.03  0.02   
160  0.02   0.01   0.03    -0.12    0.01   -0.04    0.05    -0.02  0.07   
162  0.05  -0.08  -0.10     0.05    0.07   -0.07   -0.16    -0.01 -0.14   
167 -0.03  -0.10  -0.09     0.04    0.18   -0.04   -0.15     0.01 -0.04   
187 -0.10  -0.12  -0.07    -0.08    0.26   -0.06    0.00     0.03  0.01   
209  0.00  -0.01   0.03    -0.01    0.00    0.00    0.00     0.00  0.01   
210  0.03  -0.01   0.13    -0.06   -0.02    0.00   -0.01     0.01  0.06   
211  0.00   0.00  -0.02     0.01    0.00    0.00    0.00     0.00 -0.01   
212  0.02  -0.01   0.10    -0.05   -0.01   -0.01   -0.01     0.01  0.05   
213  0.00   0.00  -0.01     0.00    0.00    0.00    0.00     0.00  0.00   
215  0.10  -0.07   0.01    -0.28    0.05   -0.04   -0.14     0.03  0.39   
222  0.00   0.01  -0.04     0.02   -0.01    0.00    0.01     0.00 -0.03   
226 -0.18  -0.11  -0.06    -0.05    0.36   -0.02    0.01     0.03  0.10   
229 -0.06  -0.11  -0.06    -0.03    0.18   -0.02   -0.02     0.00 -0.02   
234 -0.15  -0.14  -0.06    -0.01    0.29    0.03    0.00     0.01  0.04   
254 -0.41  -0.22  -0.02     0.19    0.41    0.41    0.13     0.04  0.12   
274  0.01   0.00  -0.02     0.00   -0.01    0.00   -0.01     0.00  0.00   
284  0.00   0.16   0.19    -0.05    0.05   -0.07   -0.02    -0.01  0.04   
355 -0.05   0.07   0.01     0.06   -0.03    0.07    0.09     0.00 -0.02   
356 -0.06   0.07   0.01     0.06   -0.03    0.07    0.09     0.00 -0.03   
357  0.03  -0.01  -0.05    -0.03   -0.01   -0.02   -0.02    -0.01  0.00   
358  0.01   0.00  -0.01    -0.01    0.00    0.00   -0.01     0.00  0.00   
359 -0.02   0.00   0.03     0.03    0.00    0.01    0.01     0.01 -0.01   
364  0.02  -0.01  -0.09    -0.05    0.02   -0.02   -0.03    -0.01  0.03   
365  0.62   0.03  -0.52    -0.31   -0.59   -0.16   -0.36    -0.08 -0.01   
366  0.48   0.20  -0.04     0.23   -0.89   -0.20    0.08    -0.06 -0.76   
368  0.55   0.14   0.00    -0.09   -0.65   -0.25    0.00    -0.40 -0.46   
369  0.61   0.30  -0.10     0.14   -1.01_* -0.52    0.24    -0.06 -1.11_* 
370 -0.03   0.11   0.61     0.06   -0.09   -0.18    0.29    -0.01 -0.35   
371 -0.09   0.10   0.52     0.06    0.00   -0.16    0.28     0.03 -0.28   
372  0.05   0.18  -0.10     0.03   -0.15   -0.35    0.24     0.06 -0.36   
373  0.12   0.22   0.87     0.08   -0.37   -0.26    0.33    -0.06 -0.46   
375  0.11   0.14   0.00    -0.16   -0.15   -0.15   -0.06     0.20  0.36   
376  0.21  -0.03   0.06    -0.08   -0.21    0.04   -0.13    -0.12 -0.05   
381  0.21  -0.04   0.05    -0.06   -0.21    0.04   -0.11    -0.14 -0.11   
411 -0.10  -0.02   0.00     0.04    0.07    0.06   -0.01     0.16  0.09   
412  0.00   0.00   0.00    -0.01    0.01   -0.01    0.00    -0.02  0.01   
413  0.42   0.09   0.05    -0.43   -0.20   -0.22   -0.12    -0.54  0.30   
424  0.01   0.00   0.00    -0.02    0.01   -0.01    0.00    -0.05  0.01   
425 -0.04   0.00   0.00     0.03    0.02    0.02    0.00     0.07  0.02   
455  0.00   0.00   0.00     0.00    0.00    0.00    0.00     0.00  0.00   
458  0.01   0.00   0.00     0.01   -0.01    0.01    0.01    -0.05 -0.02   
506 -0.01  -0.04   0.03    -0.02    0.09    0.09   -0.14    -0.04  0.17   
    dffit   cov.r   cook.d hat    
143 -0.11    1.08_*  0.00   0.06_*
145  0.13    1.06_*  0.00   0.04  
146  0.05    1.06_*  0.00   0.04  
147 -0.04    1.06_*  0.00   0.04  
150  0.00    1.05_*  0.00   0.03  
151 -0.01    1.06_*  0.00   0.04  
152  0.03    1.06_*  0.00   0.04  
153 -0.34    1.07_*  0.01   0.07_*
154  0.06    1.05_*  0.00   0.04  
155 -0.33    1.04    0.01   0.05_*
156 -0.34    1.06_*  0.01   0.07_*
157 -0.06    1.08_*  0.00   0.05_*
160 -0.15    1.06_*  0.00   0.04  
162  0.37    0.93_*  0.01   0.02  
167  0.37    0.94_*  0.01   0.02  
187  0.39    0.88_*  0.02   0.02  
209  0.03    1.05_*  0.00   0.03  
210  0.15    1.06_*  0.00   0.05  
211 -0.03    1.06_*  0.00   0.04  
212  0.12    1.06_*  0.00   0.05  
213 -0.01    1.06_*  0.00   0.04  
215  0.48_*  0.91_*  0.03   0.03  
222 -0.05    1.06_*  0.00   0.04  
226  0.42_*  0.97    0.02   0.03  
229  0.29    0.93_*  0.01   0.01  
234  0.37    0.94_*  0.01   0.02  
254  0.58_*  0.95    0.04   0.05  
274 -0.02    1.06_*  0.00   0.04  
284  0.29    1.06_*  0.01   0.06_*
355  0.16    1.07_*  0.00   0.05  
356  0.16    1.07_*  0.00   0.05  
357 -0.07    1.06_*  0.00   0.04  
358 -0.02    1.07_*  0.00   0.04  
359  0.05    1.07_*  0.00   0.05  
364 -0.11    1.06_*  0.00   0.04  
365 -0.95_*  0.89_*  0.10   0.07_*
366  1.00_*  0.92_*  0.11   0.08_*
368  0.76_*  0.94_*  0.06   0.07_*
369  1.33_*  0.56_*  0.18   0.05  
370  0.80_*  0.82_*  0.07   0.04  
371  0.71_*  0.87_*  0.06   0.04  
372  0.60_*  0.60_*  0.04   0.01  
373  1.10_*  0.65_*  0.13   0.04  
375  0.61_*  0.91_*  0.04   0.04  
376 -0.31    0.94_*  0.01   0.02  
381 -0.31    0.92_*  0.01   0.01  
411 -0.19    1.06_*  0.00   0.05  
412  0.02    1.05_*  0.00   0.03  
413  0.85_*  0.85_*  0.08   0.05  
424  0.06    1.05_*  0.00   0.04  
425 -0.08    1.06_*  0.00   0.04  
455  0.01    1.06_*  0.00   0.04  
458  0.06    1.05_*  0.00   0.04  
506 -0.25    0.93_*  0.01   0.01  
  1. 54
  2. 13

Inference:

  • There are 54 potentially influental points in our data.

Since this can affect our model adversely we check which influence points lies outside 2-sigma limits and remove those points.

Influence points that are outliers:

In [41]:
#boston_dataset<-boston_dataset[-inf_pts,]

count=0
outliers=c()
for(i in inf_pts){
    x <- boston_dataset[i,]
    for(col in names(boston_dataset)){
        mu=mean(boston_dataset[[col]])
        sigma=sd(boston_dataset[[col]])
        if(x[[col]]<mu-2*sigma || x[[col]]>mu+2*sigma)
        {
            count=count+1
        }
    }
    if(count>0){
        outliers <- c(outliers, i)
    }
    count=0
}
length(outliers)
51

Inference:

  • 51 out of 54 influence points lie outside 2-sigma limits.

Removing all 54 points resulted in better model.

In [9]:
# removing outlier-influence-points
boston_dataset<-boston_dataset[-inf_pts,]
dim(boston_dataset)
  1. 452
  2. 9

Modelling after removing the influental points - Model-1.2:

In [10]:
names(boston_dataset)
modelm <- lm(boston_dataset$MEDV ~ ., data =boston_dataset )
summary(modelm)

cat("RMSE:", sqrt(sum(modelm$residuals^2)/modelm$df))
  1. 'MEDV'
  2. 'ZN'
  3. 'CHAS'
  4. 'NOX'
  5. 'RM'
  6. 'DIS'
  7. 'PTRATIO'
  8. 'B'
  9. 'LSTAT'
Call:
lm(formula = boston_dataset$MEDV ~ ., data = boston_dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.7257  -2.3428  -0.1833   1.8036  10.9265 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.020270   4.249095   3.770 0.000185 ***
ZN            0.024037   0.010321   2.329 0.020306 *  
CHAS          1.644239   0.971178   1.693 0.091152 .  
NOX         -13.474252   3.114537  -4.326 1.88e-05 ***
RM            5.500903   0.385304  14.277  < 2e-16 ***
DIS          -0.924084   0.152473  -6.061 2.90e-09 ***
PTRATIO      -0.957900   0.097860  -9.789  < 2e-16 ***
B             0.013213   0.002318   5.701 2.18e-08 ***
LSTAT        -0.398100   0.043608  -9.129  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.591 on 443 degrees of freedom
Multiple R-squared:  0.8198,	Adjusted R-squared:  0.8166 
F-statistic:   252 on 8 and 443 DF,  p-value: < 2.2e-16
RMSE: 3.590891

Model-1.2
RMSE = 3.59
Adjusted R-Squared = 0.81

Thus we got considerable improvement in RMSE from 4.84 to 3.59.

Multicollinearity Check

In [15]:
vif(modelm)
ZN
2.08446670297703
CHAS
1.0607851487314
NOX
3.68666505517973
RM
2.0604832510742
DIS
3.48844990245694
PTRATIO
1.50556094261497
B
1.27168273552251
LSTAT
3.19270486986747

Inference: There is no multicollinearity.

Normality Check

In [21]:
par(mfrow=c(1,3))
options(repr.plot.height=3,repr.plot.width=15, repr.plot.res=150)
powerPlot <- function(x, name) {

dens <- density(x)
# Histogram
hist(x, probability = TRUE, xlab = name, ylab="", ylim= c(0, max(dens$y)), col = "grey",
     axes = FALSE, main = "", cex.lab=2)

# Axis
axis(1, cex.axis=2)

# Density
lines(dens, col = "red", lwd = 1.5)

# Add boxplot
par(new = TRUE)
boxplot(x, horizontal = TRUE, axes = FALSE,
        lwd = 2, col = rgb(0, 1, 1, alpha = 0.15), outlier.colour="red")
}


# a loop to run powerPlot() for all of the features
for (x in names(boston_dataset)){
    powerPlot( boston_dataset[[x]], x)
}

Inference:

  • The data are not perfectly normal.

Checking Normality of Residuals (Model-1.2):

  1. QQ Plot
  2. Shapiro Test
  3. Anderson-Darling Test
In [103]:
??qqnorm
In [22]:
# Test-1. QQ Plot
options(repr.plot.height=8, repr.plot.res=300)
ors <- rstandard(modelm)

qqnorm(ors, pch = 1, frame = FALSE, cex.lab=1.5, cex.axis=1.5, main="Normal QQ Plot of Residuals", cex.main=2)
qqline(ors, col = "steelblue", lwd = 2)
In [50]:
# Test-2
shapiro.test(rstandard(modelm))
	Shapiro-Wilk normality test

data:  rstandard(modelm)
W = 0.9806, p-value = 9.724e-06
In [51]:
# Test-3: Anerson Darling Test
ad.test(rstandard(modelm))
	Anderson-Darling normality test

data:  rstandard(modelm)
A = 3.1431, p-value = 6.853e-08

Inference: From the above tests we can infer following:

  • The residuals are not normal.

We estimate this result due to the fact that the target variable in the data wasnt distributed normally.

Linear Relation between Residuals and Xs

In [52]:
## linear relation between residuals and X's
par(mfrow=c(3,3))
for (x in names(boston_dataset)){
    plot(boston_dataset[[x]],rstudent(modelm),xlab = x,ylab = "rst(ti)", cex.lab=2, cex.axis=2)}

Inference:

  • The X values are not linearly related to Y.

Principal Component Regression Analysis

In [23]:
# GETTING Principal Components
boston_x <- boston_dataset[,c(-1)]

#Scale the data
res.pca <- prcomp(boston_x, retx=TRUE,center = FALSE, scale = TRUE)

Summary of PCAs:

In [24]:
summary(res.pca)
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     2.3972 0.9854 0.9790 0.4391 0.28491 0.18589 0.09541
Proportion of Variance 0.7183 0.1214 0.1198 0.0241 0.01015 0.00432 0.00114
Cumulative Proportion  0.7183 0.8397 0.9595 0.9836 0.99375 0.99807 0.99920
                           PC8
Standard deviation     0.07975
Proportion of Variance 0.00080
Cumulative Proportion  1.00000

Eigen values or equivalently variance of Principal Components: (The same information is available in the above summary)

In [25]:
eigval<- get_eigenvalue(res.pca)
eigval
A data.frame: 8 × 3
eigenvaluevariance.percentcumulative.variance.percent
<dbl><dbl><dbl>
Dim.15.74670676571.83383456 71.83383
Dim.20.97095197212.13689965 83.97073
Dim.30.95835171411.97939643 95.95013
Dim.40.192799964 2.40999955 98.36013
Dim.50.081171090 1.01463863 99.37477
Dim.60.034555535 0.43194419 99.80671
Dim.70.009102454 0.11378067 99.92049
Dim.80.006360505 0.07950631100.00000

Scree Plot

In [26]:
options( repr.plot.height=5)
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0,100), ggtheme = theme_minimal(base_size=18)+ theme( plot.title = element_text( hjust=0.5)))

Inference:

  • PC-1 explains the most proportion of variance.
  • The elbow occurs at PC-2.

Below we will look further into PC-1 and PC-2.

Checking contributions of variables to Principal Components

In [27]:
library(factoextra)

options(repr.plot.height=5, repr.plot.res=300)

# Contributions of variables to PCs

fviz_contrib(res.pca, choice = "var", axes = 1, top = 10, ggtheme = theme_minimal(base_size=18)+ theme( plot.title = element_text( hjust=0.5)))
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10, ggtheme = theme_minimal(base_size=18)+ theme( plot.title = element_text( hjust=0.5)))

Inference:

  • PC-1 is contributed largely by multiple variables.
  • PC-2 is largely contributed mostly by CHAS.

The below chart shows the same thing.

In [140]:
fviz_pca_var(res.pca, col.var = "blue", ggtheme = theme_minimal(base_size=18))

Model with PCA1 suggested features data: Model-2

In [19]:
##model with PCA1 suggested features data.

model_pca2 = lm(MEDV~RM+PTRATIO+DIS+LSTAT,data=boston_dataset)
summary(model_pca2)

cat("Adjusted R squared:",summary(model_pca2)$adj.r.squared,"\n")
cat("RMSE:", sqrt(sum(model_pca2$residuals^2)/model_pca2$df))
Call:
lm(formula = MEDV ~ RM + PTRATIO + DIS + LSTAT, data = boston_dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0506  -2.4811  -0.1807   2.0397  12.8470 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 19.29930    3.74631   5.152 3.88e-07 ***
RM           5.04184    0.39038  12.915  < 2e-16 ***
PTRATIO     -1.11330    0.09904 -11.241  < 2e-16 ***
DIS         -0.36098    0.10625  -3.397 0.000741 ***
LSTAT       -0.55012    0.04024 -13.671  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.842 on 447 degrees of freedom
Multiple R-squared:  0.792,	Adjusted R-squared:  0.7901 
F-statistic: 425.4 on 4 and 447 DF,  p-value: < 2.2e-16
Adjusted R squared: 0.7900978 
RMSE: 3.841527

Model-2
RMSE = 3.84
Adjusted R-squared = 0.79

lm() Model

The lm() model is built based upon the suggested features of all of the previous methods. The influental outliers also have been removed in previous section.

Modelling:

In [18]:
lm_model = lm(MEDV~., data=boston_dataset)

summary(lm_model)

cat("Adjusted R squared:",summary(lm_model)$adj.r.squared,"\n")
cat("RMSE:", sqrt(sum(lm_model$residuals^2)/lm_model$df))
Call:
lm(formula = MEDV ~ ., data = boston_dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.7257  -2.3428  -0.1833   1.8036  10.9265 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.020270   4.249095   3.770 0.000185 ***
ZN            0.024037   0.010321   2.329 0.020306 *  
CHAS          1.644239   0.971178   1.693 0.091152 .  
NOX         -13.474252   3.114537  -4.326 1.88e-05 ***
RM            5.500903   0.385304  14.277  < 2e-16 ***
DIS          -0.924084   0.152473  -6.061 2.90e-09 ***
PTRATIO      -0.957900   0.097860  -9.789  < 2e-16 ***
B             0.013213   0.002318   5.701 2.18e-08 ***
LSTAT        -0.398100   0.043608  -9.129  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.591 on 443 degrees of freedom
Multiple R-squared:  0.8198,	Adjusted R-squared:  0.8166 
F-statistic:   252 on 8 and 443 DF,  p-value: < 2.2e-16
Adjusted R squared: 0.8165939 
RMSE: 3.590891

FInal Model
RMSE = 3.59
Adjusted R-Squared = 0.81

Above is the best value we got for RMSE and Adjusted R-squared among all of our models.

Predicting using the model:

In [28]:
options( repr.plot.height=8)

pred<- predict(lm_model, boston_dataset)
act_vs_pred<- data.frame(actual= boston_dataset$MEDV, predicted= pred)

ggplot(act_vs_pred, aes(x= predicted, y=actual))+
geom_point()+
geom_abline(intercept=0, slope=1)+
theme_minimal(base_size=18) + theme( plot.title = element_text( hjust=0.5)) +
labs(x= 'Predicted Values', y= "Actual Values", title= "Predicted vs Actual Values")

Inference:

  • There is the linearity between predicted values and the observed values.

Checking the normality of residuals:

In [29]:
options(repr.plot.res=300, repr.plot.height=8)

res <- as.data.frame(residuals(lm_model))
ggplot(res,aes(residuals(lm_model)))+
geom_histogram(fill='brown',alpha=0.5, binwidth=1) +
theme_minimal( base_size=18) + theme( plot.title = element_text( hjust=0.5)) + labs(title="Distribution of Residuals")

Inference:

  • The residuals does not follow the exact normal distribution.
  • But we take it to be normal within accepted margins.
In [30]:
options(repr.plot.height=12,repr.plot.res=300)
par(mfrow = c(2,2))
plot(lm_model, cex.lab=1.5, cex.axis=1.5)

CV Model

Partitioning training and test dataset:

In [11]:
suppressMessages(library(caret))
set.seed(998)
trainPartitionRows <- createDataPartition(boston_dataset$MEDV, p = .90, list = FALSE)

cat("Total no of observations = ", nrow(boston_dataset),"\n")
cat("# for training = ", nrow(trainPartitionRows),"\n")
cat("# for testing = ", nrow(boston_dataset) - nrow(trainPartitionRows),"\n")
Total no of observations =  452 
# for training =  409 
# for testing =  43 
In [12]:
trainDataset <- boston_dataset[ trainPartitionRows,]
testDataset  <- boston_dataset[-trainPartitionRows,]

cat("\nHead of TRAIN DATASET:")
head(trainDataset)
cat("\nHead of TEST DATASET:")
head(testDataset)
Head of TRAIN DATASET:
A tibble: 6 × 9
MEDVZNCHASNOXRMDISPTRATIOBLSTAT
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
24.018.000.5386.5754.090015.3396.90 4.98
21.6 0.000.4696.4214.967117.8396.90 9.14
34.7 0.000.4697.1854.967117.8392.83 4.03
33.4 0.000.4586.9986.062218.7394.63 2.94
22.912.500.5246.0125.560515.2395.6012.43
27.112.500.5246.1725.950515.2396.9019.15
Head of TEST DATASET:
A tibble: 6 × 9
MEDVZNCHASNOXRMDISPTRATIOBLSTAT
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
36.2 000.4587.1476.062218.7396.90 5.33
28.7 000.4586.4306.062218.7394.12 5.21
15.2 000.5386.1423.976921.0396.9018.72
21.0 000.4995.8503.934219.2396.90 8.77
20.0 000.4485.7865.100417.9396.9014.15
35.49000.4037.2498.696617.9395.93 4.81

Performing Cross-Validation:

In [13]:
fitControl <- trainControl(## 10-fold CV
                          method = "cv", # repeatedcv
                          number = 10
                          ## repeated ten times
                          #repeats = 10,
                          )

CVModel <- train(MEDV ~ .,
                    data = trainDataset,
                    method = "lm",
                    trControl = fitControl,
                    ## This last option is actually one
                    ## for gbm() that passes through
                    #verbose = TRUE
                    )

CVModel
summary(CVModel)
Linear Regression 

409 samples
  8 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 369, 367, 368, 368, 368, 368, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  3.559482  0.8238594  2.697849

Tuning parameter 'intercept' was held constant at a value of TRUE
Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.8997  -2.2667  -0.1929   1.7162  10.9619 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.897890   4.528117   3.290  0.00109 ** 
ZN            0.024623   0.010875   2.264  0.02410 *  
CHAS          1.575802   0.965662   1.632  0.10350    
NOX         -12.097020   3.254296  -3.717  0.00023 ***
RM            5.553978   0.406459  13.664  < 2e-16 ***
DIS          -0.914665   0.158550  -5.769 1.60e-08 ***
PTRATIO      -0.979858   0.104012  -9.421  < 2e-16 ***
B             0.014407   0.002423   5.945 6.03e-09 ***
LSTAT        -0.400045   0.044929  -8.904  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.553 on 400 degrees of freedom
Multiple R-squared:  0.8244,	Adjusted R-squared:  0.8209 
F-statistic: 234.8 on 8 and 400 DF,  p-value: < 2.2e-16

CV Model
RMSE = 3.55
Adjusted R-squared = 0.82
Mean Absolute Error (MAE) = 2.69

Validation on training data for CV-Model:

In [31]:
#validation on training data

pred<- predict(CVModel, trainDataset)
validation<- data.frame(actual= trainDataset$MEDV, predicted= pred)
ggplot(validation, aes(x= predicted, y=actual))+
geom_point()+
geom_abline(intercept=0, slope=1)+
theme_minimal(base_size=18) + theme( plot.title = element_text( hjust=0.5)) +
labs(x= 'Predicted Values', y= "Actual Values", title= "Predicted vs Actual Values")

Inference:

  • Their is the linear relation between observed and predicted values.

Summary of Prediction:

In [29]:
residuals<- (validation$actual - validation$predicted)^2
RMSE <- sqrt(mean(residuals))
summary(pred)

cat("RMSE on train data:",RMSE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.444  17.395  21.257  22.204  26.889  45.086 
RMSE on train data: 3.513884

Validation on Test data:

In [35]:
#TESTING

pred1<- predict(CVModel, newdata= testDataset)
test<- data.frame(actual= testDataset$MEDV, predicted= pred1)
ggplot(test, aes(x= predicted, y=actual))+
geom_point()+
geom_abline(intercept=0, slope=1)+
theme_minimal(base_size=18) + theme( plot.title = element_text( hjust=0.5)) +
labs(x= 'Predicted Values', y= "Actual Values", title= "Predicted vs Actual Values")

Summary of validation on test data:

In [32]:
summary(pred1)
residuals1<- (test$actual - test$predicted)^2
RMSE1 <- sqrt(mean(residuals1))
cat("RMSE on test data:",RMSE1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -3.219  17.532  21.150  21.761  27.622  36.225 
RMSE on test data: 3.946302

Inference:
RMSE on test data: 3.94630
RMSE on train data: 3.513884

  • The difference in RMSE between train and test data is ~0.4

Training perfomance vs Testing performance

In [163]:
n=nrow(trainDataset)
# create empty data frame 
learnCurve <- data.frame(m=double(n),
                     trainRMSE=double(n),
                     cvRMSE = double(n))

# test data response feature
testY <- testDataset$MEDV

# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"

# loop over training examples
for (i in 137:nrow(trainDataset)) {
    learnCurve$m[i] <- i
    
    # train learning algorithm with size i
    fit.lm <- train(MEDV~., 
                    data=trainDataset[1:i,], 
                    method="lm", 
                    metric="RMSE",
                    #preProc=c("center", "scale"), 
                    trControl=trainControl)    
    
    learnCurve$trainRMSE[i] <- fit.lm$results$RMSE
    
    # use trained parameters to predict on test data
    prediction <- predict(fit.lm, newdata = testDataset[,-1])
    rmse <- postResample(prediction, testY)
    learnCurve$cvRMSE[i] <- rmse[1]
}
Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”
In [164]:
plot(log(learnCurve$trainRMSE),type = "o",col = "red",xlab = "Training set size",
           ylab = "Error (RMSE)", main = "Linear Model Learning Curve", ylim=c(1,1.50), xlim=c(136,410), cex.axis=1.5, cex.lab=1.5)
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")

legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
      col = c("red", "blue"))

Saving Model

Since the best performing model is CV model thus we take same as the CV model.

In [13]:
final_model <- CVModel
saveRDS(final_model, "./model.rds")
In [204]:
summary(final_model)
Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.8997  -2.2667  -0.1929   1.7162  10.9619 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.897890   4.528117   3.290  0.00109 ** 
ZN            0.024623   0.010875   2.264  0.02410 *  
CHAS          1.575802   0.965662   1.632  0.10350    
NOX         -12.097020   3.254296  -3.717  0.00023 ***
RM            5.553978   0.406459  13.664  < 2e-16 ***
DIS          -0.914665   0.158550  -5.769 1.60e-08 ***
PTRATIO      -0.979858   0.104012  -9.421  < 2e-16 ***
B             0.014407   0.002423   5.945 6.03e-09 ***
LSTAT        -0.400045   0.044929  -8.904  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.553 on 400 degrees of freedom
Multiple R-squared:  0.8244,	Adjusted R-squared:  0.8209 
F-statistic: 234.8 on 8 and 400 DF,  p-value: < 2.2e-16

Final Model(CV Model)
RMSE = 3.55
Adjusted R-squared = 0.82
Mean Absolute Error (MAE) = 2.69